Generalized linear models

So far we have used linear regressions, where we modeled the outcome as follows:

\[ y \sim Normal(\mu,sigma) \\ \mu = \alpha + \beta X \]

This regression works generally well, even if \(\small y\) is not normally distributed. (The residuals should be though. And a posterior predictive plot is always useful!)

Distributions

However, for some outcome types a linear regression simply does not work. This is particularity clear when the outcome is bound to be equal to or larger than zero (like counts that “clump” at zero) or when the outcome is binary.

As an example the next figure shows end of year math grades in a Portuguese school.1.

df=read.table("data/student-mat.csv",sep=";",header=TRUE)
df = df[df$Medu>0,]
set_par()
x = barplot(c(table(df$G3),0,0,0), xaxt = "n", border = "grey")
axis(1,at = x[seq(0,20,2)+1], labels = seq(0,20,2))

In these situations we can use this kind of model:

\[ y \sim dist(\theta_1,\theta_2) \\ \]

Here \(\small dist(\theta_1,\theta_2)\) is a distribution with two parameters2 that is consistent with the observed data.

Choosing a different distribution means choosing a different likelihood function. That is, we exchange dnomr with an alternative distribution that is appropriate for the data.

Here are a few examples:

  • The Binomial distribution models the “number of successes in a sequence of n independent experiments, each asking a yes–no question”. Hence we use the Binomial distribution function to calculate the likelihood of the data given model and parameters when we model number of successes. A special case is when we have only one trial / experiment, the Binomial distribution models binary outcomes.
set_par()
x = seq(0,5,1)
plot(x-.2,dbinom(x, 5, .2),"h", xlim = c(0,20), lwd = 2,
     ylab = "probability mass", xlab = "number of success")
x = seq(0,10,1)
lines(x,dbinom(x, 10, .5),"h", col = "blue", lwd = 2)
x = seq(0,20,1)
lines(x+.2,dbinom(x, 20, .75),"h", col = "red", lwd = 2)
legend("topright",
       lwd = 2, col = c("black","blue","red"),
       legend = c(
         expression(theta[1]~" = n = 5, "~theta[2]~" = p = .2"),
         expression(theta[1]~" = n = 10, "~theta[2]~" = p = .5"),
         expression(theta[1]~" = n = 30, "~theta[2]~" = p = .75")
       ),
       bty = "n")

  • The Poisson distribution “expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate” . Hence we use the Poisson distribution function to calculate the likelihood of the data given model and parameters when we occurrence of events (counts).
set_par()
x = seq(0,5,1)
plot(x-.2,dpois(x, .5),"h", xlim = c(0,20), lwd = 2,
     ylab = "probability mass", xlab = "number of events")
x = seq(0,10,1)
lines(x,dpois(x, 5),"h", col = "blue", lwd = 2)
x = seq(0,20,1)
lines(x+.2,dpois(x, 10),"h", col = "red", lwd = 2)
legend("topright",
       lwd = 2, col = c("black","blue","red"),
       legend = c(
         expression(theta[1]~" = "~lambda~" = p =  .5"),
         expression(theta[1]~" = "~lambda~" = p = 5"),
         expression(theta[1]~" = "~lambda~" = p = 10")
       ),
       bty = "n")

Logistic regression

For a hands on example for logistic regression we will try to predict failing the math class with the Portugese school data shown above. In particular, we will use these predictors:

It is always a good idea to plot the data first:

data.list = list(
  Medu = df$Medu,
  failures = df$failures,
  fail = df$G3 == 0
) 
set_par(mfrow = c(1,2))
table(data.list$failures) %>% barplot(border = "grey",   xlab = "number fails")
table(data.list$Medu) %>% barplot(border = "grey", ylab = "count", xlab = "maternal edu")

We are centering Medu, so that the intercept measures odds of “middle high” maternal education.

data.list$cMedu = data.list$Medu-2.5

We begin the analysis with an intercept model only. Especially if one is not familiar with a model, it is good to start simple and add complexity step by step.

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a,
  a ~ dnorm(0,10)
)

Prior predictive check

And we estimate the model with ulam (i.e. Stan):

u.fit_I = ulam(
  model,
  data = data.list,
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

We extract the prior and plot the prior prediction for the failure rate:

prior = extract.prior(u.fit_I)
set_par()
hist(inv_logit(prior$a), main="", breaks = 30)

To see what is going on here, lets overlay the logistic function on the prior:

set_par(mfrow = c(1,2), mar=c(3,3,3,3), cex = 1)

curve(dnorm(x,0,10),-30,30, ylab = "prior density", xlab = "a")
axis(4,at = seq(0,0.04, length.out = 5),
     labels = seq(0,1,length.out = 5), col = "red")
curve(inv_logit(x)/25, add = T, col = "red")
abline(h = .1/25, lty = 3, col = "grey")
title("a ~ dnorm(0,10)")

curve(dnorm(x,0,2),-6,6, ylab = "prior density", xlab = "a")
axis(4,at = seq(0,0.2, length.out = 5),
     labels = seq(0,1,length.out = 5), col = "red")
curve(inv_logit(x)/5, add = T, col = "red")
abline(h = .1/5, lty = 3, col = "grey")
title("a ~ dnorm(0,2)")

With a wide prior, most of the probability mass is either smaller than -5, leading to p very close to 0, or larger than 5, leading to p close to 1. Hence we will use a tighter prior on a.

How about the regression coefficients \(\small beta\)? Regression coefficients are log odds ratios.

For instance, lets assume the following:

  • for children who never failed a class, 2 out of 100 children fail the class now
  • for children who once failed a class, 4 out of 100 children fail the class

then the odds ratio is :

\[ \begin{align*} OR = & \frac{\textrm{failure odds with no prior fail}}{\textrm{failure odds with prior fail}} \\ = & \frac{\frac{2}{98}}{\frac{4}{96}} = \frac{.0204}{.0417} = .49 \end{align*} \]

Note that this specific odds ratio comes close to the risk ratio of \(\small .2 / .4 = .5\). This is, however, not generally the case. If the probabilities are far away from zero, risk ratio and odds ratio do not align! We can see this if we just multiply the probability of failure by 15 in both groups: \[ \begin{align*} OR = \frac{\frac{30}{70}}{\frac{60}{40}} = \frac{.428}{.667} = .29 \end{align*} \]

Lets get back to specifying our prior for \(\small \beta\):

  • We had an odds ration of 0.49, which corresponds to a log odds ratio of log(.49) ~ -.7.
  • This means that if we assume that probability is low and a shift from no to one prior fail comes with a doubling of the fail probability, we should comfortably allow \(\small \beta\) values of size .7.
  • If we wanted to set an informative shrinkage prior, which however does not prefer one direction of the effect, we could set a dnorm(0,.7) prior.
  • But we are not so sure and don’t like to informative priors, so we set a dnorm(0,2) prior.

Lets specify such a model and look at the prior predictions for the difference between two levels of education.

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a + b1*failures,
  a ~ dnorm(0,2),
  b1 ~ dnorm(0,2)
)
u.fit_F = ulam(
  model,
  data = data.list,
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

prior = extract.prior(u.fit_F)

Now we can use the link_ulam function and new data to generate predictions from the prior.

First we look at the effect of a one level change of education.

p = link_ulam(
  u.fit_F, post = prior, 
  data = list(failures = c(0,1)))
set_par(mfrow = c(1,2), mar=c(3,3,3,3), cex = 1)
hist(p[,2]-p[,1],
     main = "risk difference", xlab = "P(fail|high) - P(fail|low)")
hist(p[,2]/p[,1], breaks = 30,
     main = "risk ratio", xlab = "P(fail|high) / P(fail|low)")

These differences and ratios are pretty large, so it is safe to make the prior a bit narrower and set the standard deviation for the regression weights to 1.

Here is the our model so far, which include previous class fails and maternal education as predictors:

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a + b1*failures + b2*cMedu,
  a ~ dnorm(0,2),
  b1 ~ dnorm(0,1),
  b2 ~ dnorm(0,1)
)
u.fit_FE = ulam(
  model,
  data = data.list,
  log_lik = TRUE,   # for model comparison
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

Some quick convergence diagnostics:

precis(u.fit_FE, depth = 2)
##          mean        sd       5.5%       94.5%    n_eff    Rhat4
## a  -2.5825926 0.2208076 -2.9475630 -2.24703945 2012.444 1.001705
## b1  0.6923649 0.1726043  0.4124413  0.96734347 2240.652 1.002301
## b2 -0.3014267 0.1659922 -0.5661254 -0.03307494 3014.866 1.000280

Which predictions one generates from the prior depends on research questions and domain knowlegde. My approach is to check if prior-predicted data are broadly consistent with the overall distribution of the data and to check if effect estimnates fall into a plausible range. The latter is easier for gaussian models and harder for models with exponential link functions. One simple rule of thumb for logistic or binomial regressions is that with a \(\small Normal(0,1)\) prior on regression coefficients the the 97.5 percentile is 1.96. If we exponentiate this we get 7, wich means that a \(\small Normal(0,1)\) prior expresses the pior information that the OR is probably not much larger then 7. Note that this prior would still be overwhelmed if we have a reasonable amount (N>25)3 of data.

Posterior predictive checks

To see if our model describes the data well, we plot model predicted and observed data together:

aggr.fun = function(x) return(c(mean = mean(x), N = length(x)))
obs = 
  aggregate(
  data.list$fail,
  by = with(data.list, 
            data.frame(cMedu = cMedu, Medu = Medu, failures = failures)), 
  aggr.fun
  )
obs = cbind(obs,obs$x)

sim.data = 
  unique(as.data.frame(data.list)[,c(2,4)])

p = link_ulam(
  u.fit_FE,
  data = sim.data)

pp.stats = cbind(
  sim.data,
  m = colMeans(p),
  t(apply(p,2,PI))
)

set_par(mfrow = c(2,2), mar=c(3,3,1.5,0.5), cex = .75)
tmp = 
  lapply(unique(obs$cMedu), function(x) {
  plot(obs[obs$cMedu == x,"failures"],
       obs[obs$cMedu == x,"mean"],
       cex = sqrt(obs[obs$cMedu == x,"N"]),
       xlim = c(-.5,3.5), ylim = c(0,.5),
       ylab = "proportion fail",
       xlab = "# past failures",
       xaxt = "n", pch = 16, col = "grey")
    title(paste("maternal edu",x+2.5),line = 0.5, cex.main = 1)
    axis(1,at = 0:3)
    points(pp.stats[pp.stats$cMedu == x,"failures"],
       pp.stats[pp.stats$cMedu == x,"m"], pch = 16, col = "blue")
    arrows(pp.stats[pp.stats$cMedu == x,"failures"],
           y0 = pp.stats[pp.stats$cMedu == x,"5%"],
           y1 = pp.stats[pp.stats$cMedu == x,"94%"],
           length = 0, col = "blue")
})

This does not look great. We can try to improve the model in two ways:

  • give each level of maternal education its own intercept
  • give each level of maternal education its own slope

We are using the indexing approach discusses in the book for this.

model = alist(
  fail ~ dbinom(1,p),
  logit(p) <- a[Medu] + b[Medu]*failures,
  a[Medu] ~ dnorm(0,2),
  b[Medu] ~ dnorm(0,1)
)
u.fit_FE.2 = ulam(
  model,
  data = data.list,
  log_lik = TRUE,   # for model comparison
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan

Did the chains converge?

precis(u.fit_FE.2, depth = 2) %>% round(2)
##       mean   sd  5.5% 94.5%   n_eff Rhat4
## a[1] -2.17 0.47 -2.94 -1.44 2831.32     1
## a[2] -2.10 0.33 -2.66 -1.60 3402.34     1
## a[3] -2.85 0.45 -3.60 -2.18 2801.48     1
## a[4] -3.01 0.41 -3.70 -2.40 4075.78     1
## b[1]  0.53 0.31  0.04  1.00 2737.27     1
## b[2]  0.59 0.32  0.08  1.10 3472.30     1
## b[3]  0.82 0.31  0.32  1.31 3019.55     1
## b[4]  0.01 0.74 -1.26  1.11 3258.11     1

This looks OK, so we do again a posterior predictive check:

sim.data = 
  unique(as.data.frame(data.list)[,c(1,2)]) 

p = link_ulam(
  u.fit_FE.2,
  data = sim.data)

pp.stats = cbind(
  sim.data,
  m = colMeans(p),
  t(apply(p,2,PI))
)

set_par(mfrow = c(2,2), mar=c(3,3,1.5,0.5), cex = .75)
tmp = 
  lapply(unique(obs$Medu), function(x) {
  plot(obs[obs$Medu == x,"failures"],
       obs[obs$Medu == x,"mean"],
       cex = sqrt(obs[obs$Medu == x,"N"]),
       xlim = c(-.5,3.5), ylim = c(0,.5),
       ylab = "proportion fail",
       xlab = "# past failures",
       xaxt = "n", pch = 16, col = "grey")
    title(paste("maternal edu",x),line = 0.5, cex.main = 1)
    axis(1,at = 0:3)
    points(pp.stats[pp.stats$Medu == x,"failures"],
       pp.stats[pp.stats$Medu == x,"m"], pch = 16, col = "blue")
    arrows(pp.stats[pp.stats$Medu == x,"failures"],
           y0 = pp.stats[pp.stats$Medu == x,"5%"],
           y1 = pp.stats[pp.stats$Medu == x,"94%"],
           length = 0, col = "blue")
})

This does not look that much better. Let’s check this with PSIS-LOO:

compare(u.fit_FE.2,u.fit_FE, func = "PSIS")
##                PSIS       SE    dPSIS      dSE    pPSIS     weight
## u.fit_FE   232.7150 25.44361 0.000000       NA 2.904342 0.97496306
## u.fit_FE.2 240.0391 25.55773 7.324094 3.263811 7.150832 0.02503694

In this case the added complexity has not helped much. So lets look at the parameters of the first model:

plot(precis(u.fit_FE, depth = 2))

precis(u.fit_FE, depth = 2) %>% round(2)
##     mean   sd  5.5% 94.5%   n_eff Rhat4
## a  -2.58 0.22 -2.95 -2.25 2012.44     1
## b1  0.69 0.17  0.41  0.97 2240.65     1
## b2 -0.30 0.17 -0.57 -0.03 3014.87     1

What does this all mean?

Because the parameters are log-odds ratios, we exponentiate them to get ORs. Because the probability of our outcome is (relatively) close to zero, we can interprete the OR as approximate risk ratios. Hence

  • the intercept value is -2.58, which gives a baseline risk of 0.075774
  • the regression weight for failures is 0.69, which means that the odds ratio and approximate relative risk to fail the class if one has failed it once before is exp(0.69) 1.99.
  • the regression weight for maternal education is -0.3, which means that the odds ratio and approximate relative risk to fail the class if one has failed it once before is exp(-0.3) = 0.74. Correspondingly, if maternal education is changed by two levels, the OR (~RR) is exp(2 * -0.3) = 0.55

Constrasts

The OR and relative risks that we have to deal with due to the link function make it hard to interpret the results. But if we create posterior predictions we can simply calculate contrasts.

Lets look at the data we used to generate posterior predictions:

sim.data = 
  unique(as.data.frame(u.fit_FE@data)[, c("cMedu","failures","Medu")]) 
sim.data = sim.data[order(sim.data$cMedu,sim.data$failures),]
rownames(sim.data) = NULL
sim.data
##    cMedu failures Medu
## 1   -1.5        0    1
## 2   -1.5        1    1
## 3   -1.5        2    1
## 4   -1.5        3    1
## 5   -0.5        0    2
## 6   -0.5        1    2
## 7   -0.5        2    2
## 8   -0.5        3    2
## 9    0.5        0    3
## 10   0.5        1    3
## 11   0.5        2    3
## 12   0.5        3    3
## 13   1.5        0    4
## 14   1.5        1    4
## 15   1.5        2    4

If we now generate the posterior predictions, we get a matrix in which each column corresponds to one row in sim.data

pp = link_ulam(
  u.fit_FE,
  data = sim.data)
dim(pp)
## [1] 4000   15

A simple question would be what the effect of having one vs zero previous failures is when maternal educational level is 1. We simply look up in the sim.data table that we need to compare columns one and two in the posterior predictions for this. In the same manner we can do this for maternal educational level 2.

# effect of prior failure for Medu = 1
delta.L1 = pp[,2] - pp[, 1]
# effect of prior failure for Medu = 2
delta.L2 = pp[,6] - pp[, 5]
# plot data
xlim = range(c(delta.L1,delta.L2))
breaks = seq(xlim[1]-.01,xlim[2]+.01, length.out = 25)
set_par(mfrow = c(1,3), cex = 1.5)
hist(delta.L1, main = "", xlim = xlim, breaks = breaks)
abline(v = 0, col = "red", lty = 3)
hist(delta.L2, main = "", xlim = xlim, breaks = breaks)
abline(v = 0, col = "red", lty = 3)
hist(delta.L2-delta.L1, main = "")
abline(v = 0, col = "red", lty = 3)

The bottom histogram shows an interaction effect on the scale of risk differences, even though our model has no interaction term! This is due to the exponent in the link function!

This interaction effect can also be understood by remembering that odds ratios are relative effects, i.e. the effect of a one unit change of a variable depends on the probability of success at the comparison value. Here is an example with numbers. Lets assume our intercepts for two groups are 1 and 2, respecitively, and the log-odds ration for the effect of interest is 1. Then we can calculate the risk differences as inv_logit(1+1) - inv_logit(1+0) = 0.15 and inv_logit(2+1) - inv_logit(2+0) = 0.07, and see that the absolute effect of the logg odds ratio depends on the starting point. Calculate inv_logit(1+1) / inv_logit(1+0) and inv_logit(2+1) / inv_logit(2+0) and compare the results. Therefore, the meaningfulness of a result in log odds ratio can only be understood based on some background knnowledge. Note that the dependence of the absolute effect on the comparison value also explains the “interaction with oneself” described in the chapter.

Here is another example: If we want to see the difference in fail probability between maternal education levels 2 and 3, we calculate one vector of samples with fail probabilities at level 1, one for level 4, and subtract them:

fail.level1 = rowMeans(pp[,1:4])
fail.level4 = rowMeans(pp[,13:15])
delta = fail.level4 - fail.level1
set_par( mar=c(3,3,.5,.5), cex = 1.5)
hist(delta, breaks = 25, main = "")

However, this analysis assumed that the pupils with different number of previously failed classes are equally frequent, which is not the case. So we need to weight with these probabilities:

M1_p.failures = prop.table(table(df$failures[df$Medu == 1]))
M4_p.failures = prop.table(table(df$failures[df$Medu == 4]))
fail.level1 = pp[,1:4] %*% M1_p.failures 
fail.level4 = pp[,13:15] %*% M4_p.failures 
delta = fail.level4 - fail.level1
set_par( mar=c(3,3,.5,.5), cex = 1.5)
hist(delta, breaks = 25, main = "")

This are just two examples. Every calculation we can make based on the posterior prediction is legal!

Aggregated binomial regression

The binomial regression is the general case of which the logistic regression is a special case. Both have a logistic link function, but whereas logistic regressions estimates if one trial was a “success”, binomial regressions estimate how many out of N trials were successes.

To look at the aggregated binomial regression we will look at data from Norwegian tests and ask the questions “Does the number of excused children depend on the topic English, reading or Math?”

Here is the data:

load(file = "data/NasjonalTests5.Rdata")

NatTests5 
##                   Fylke topic N_total N_part N_Miss N_Exc Fylke_Nr topic_Nr
## 1                 Viken   Eng   15426  14399    209   818        1        1
## 2                 Viken  Math   15426  14463    207   756        1        2
## 3                 Viken  Read   15426  14392    155   879        1        3
## 4              Vestland   Eng    7684   7128    103   453        2        1
## 5              Vestland  Math    7684   7212     88   384        2        2
## 6              Vestland  Read    7684   7115    108   461        2        3
## 7                  Oslo   Eng    6978   6455    153   370        3        1
## 8                  Oslo  Math    6978   6399    153   426        3        2
## 9                  Oslo  Read    6978   6406    125   447        3        3
## 10             Rogaland   Eng    6451   6038     78   335        4        1
## 11             Rogaland  Math    6451   6048     74   329        4        2
## 12             Rogaland  Read    6451   5971     80   400        4        3
## 13            Trøndelag   Eng    5464   5000    109   355        5        1
## 14            Trøndelag  Math    5464   5030    117   317        5        2
## 15            Trøndelag  Read    5464   5004    110   350        5        3
## 16 Vestfold og Telemark   Eng    4863   4484     58   321        6        1
## 17 Vestfold og Telemark  Math    4863   4543     53   267        6        2
## 18 Vestfold og Telemark  Read    4863   4483     54   326        6        3
## 19                Agder   Eng    3911   3621     63   227        7        1
## 20                Agder  Math    3911   3650     62   199        7        2
## 21                Agder  Read    3911   3618     54   239        7        3
## 22            Innlandet   Eng    3804   3497     75   232        8        1
## 23            Innlandet  Math    3804   3526     65   213        8        2
## 24            Innlandet  Read    3804   3502     81   221        8        3
## 25      Møre og Romsdal   Eng    3185   2908     25   252        9        1
## 26      Møre og Romsdal  Math    3185   2934     28   223        9        2
## 27      Møre og Romsdal  Read    3185   2865     33   287        9        3
## 28             Nordland   Eng    2645   2394     74   177       10        1
## 29             Nordland  Math    2645   2420     74   151       10        2
## 30             Nordland  Read    2645   2375     85   185       10        3
## 31    Troms og Finnmark   Eng    2607   2364     58   185       11        1
## 32    Troms og Finnmark  Math    2607   2380     60   167       11        2
## 33    Troms og Finnmark  Read    2607   2297    112   198       11        3

To set up a binomial model, we set up a model that

Here is the model:

NT.model = 
  alist(
    N_Exc ~ dbinom(N_total, p),
    logit(p) <- a[Fylke_Nr] + b[topic_Nr],
    a[Fylke_Nr] ~ dnorm(0,1),
    b[topic_Nr] ~ dnorm(0,1))

Here we are fitting the model.

u.fit_NT = ulam(
  NT.model,
  data = NatTests5,
  log_lik = TRUE,   # for model comparison
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan
precis(u.fit_NT, depth = 2) %>% round(2)

n_eff and Rhat don’t look great. We look at traceplots for the first a and b to figure out what is going on:

library(posterior)
library(bayesplot)
draws = u.fit_NT@stanfit %>% as_draws()
draws %>% subset_draws(c("a[1]","b[1]")) %>% mcmc_trace()

The chains are not mixed as well as one would hope, and it looks a bit like the chains are correlated. We examine this with a pairs plot:

draws %>% subset_draws(c("a[1]","b[1]")) %>% mcmc_pairs()

The strong negative correlation reminds us that the Fylke intercept actually already represents the proportion of excused pupils in the first (any one of the) topics. WE have to re-formulate the model and use only two of the topics.

We update our data so that there is one dummy variable for Math and one for English tests.

NatTests5$Math = 1*(NatTests5$topic == "Math")
NatTests5$Eng = 1*(NatTests5$topic == "Eng")

And we formulate and estimate an updated model:

NT.model2 = 
  alist(
    N_Exc ~ dbinom(N_total, p),
    logit(p) <- a[Fylke_Nr] + b1*Math + b2*Eng,
    a[Fylke_Nr] ~ dnorm(0,1),
    b1 ~ dnorm(0,1),
    b2 ~ dnorm(0,1))

u.fit_NT2 = ulam(
  NT.model2,
  data = NatTests5,
  log_lik = TRUE,   # for model comparison
  iter = 2000,      # 2000 iterations, (1000 warmup)
  chains = 4,       # four chains, to check convergence
  cores = 4,        # use 4 cores in parallel
  cmdstan = TRUE)   # use cmdstanr not rstan
precis(u.fit_NT2, depth = 2) %>% round(2)

This looks much better.

Here is a posterior predictive plot.

NatTests5$obs = NatTests5$N_Exc/NatTests5$N_total
NatTests5$x = NatTests5$Fylke_Nr + (NatTests5$topic_Nr-2)/8

set_par(mar=c(3,1,.5,.5))

with(NatTests5, 
     plot(obs, rev(x), cex = 2, col = topic_Nr,
          pch = 16, xlim = c(0,.1),
          yaxt = "n", ylab = "", xlab = "proportion excused")
     )

with(NatTests5[NatTests5$topic == "Math",],
     text(0, rev(x), Fylke, pos = 4)
     )

legend("topright", col = 1:3, 
       legend = c("English","Math","Reading"), bty = "n", pch = 16)

pp = link(u.fit_NT2)
CI = apply(pp,2,PI)
arrows(y0 = rev(NatTests5$x), x0 = CI[1,], x1 = CI[2,], col = "blue", length = 0)
points(colMeans(pp), rev(NatTests5$x), cex = 1.25,
       col = "blue", pch = 21, bg = "white")

What do the coefficients tell us?

Here is a plot with the log odds to be excused from the reading test by Fylke:

plot(precis(u.fit_NT2, depth = 2, pars = "a"))
yy = unique(NatTests5[,c("Fylke","Fylke_Nr")])
text(-2.4,rev(yy$Fylke_Nr), yy$Fylke,pos = 3)

Regarding the effect of topic we find that

plot(precis(u.fit_NT2, pars = c("b1","b2")))


  1. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. The data can be downloaded here↩︎

  2. There are also distributions with 1 or 3 or more parameters↩︎

  3. I plugged this number from thin air, should really to a quick calculation on this.↩︎